Website Analysis Integration
This document explains the website analysis service integration, focusing on how HTML is converted to markdown, how request metadata is processed, and how advanced scraping capabilities are implemented. It also covers website validation mechanisms, content extraction patterns, and data transformation workflows. The super scraper functionality for intelligent web content extraction, DOM manipulation, and structured data processing is documented alongside practical examples of website analysis workflows, content processing patterns, and validation strategies. Finally, it addresses ethical scraping, rate limiting, performance optimization, and troubleshooting for common issues.
The website analysis pipeline spans several layers:
Routers define HTTP endpoints for website analysis and validation.
Services orchestrate content fetching, conversion, and LLM-driven answer generation.
Tools implement HTML-to-markdown conversion, server-side markdown fetching, and a super scraper for advanced content extraction.
Prompts define the instruction templates and chains used to synthesize answers.
Configuration manages environment variables and logging.
POST /"] R2["website_validator.py
POST /validate-website"] end subgraph "Services" S1["website_service.py
WebsiteService"] S2["website_validator_service.py
validate_website"] end subgraph "Tools" T1["website_context/__init__.py
Exports"] T2["html_md.py
return_html_md"] T3["request_md.py
return_markdown"] T4["super_scraper.py
clean_response"] end subgraph "Prompts" P1["website.py
Prompt + Chain"] P2["prompt_injection_validator.py
Validation Template"] end subgraph "Models" M1["requests/website.py
WebsiteRequest"] M2["response/website.py
WebsiteResponse"] end subgraph "Config" C1["core/config.py
Environment & Logging"] end R1 --> S1 R2 --> S2 S1 --> T1 T1 --> T2 T1 --> T3 S1 --> P1 S2 --> P2 S1 --> M1 S1 --> M2 S2 --> M1 S2 --> M2 S1 --> C1 S2 --> C1
Diagram sources
Section sources
WebsiteService orchestrates the end-to-end website analysis:
Fetches server-side markdown via a Jina AI proxy.
Converts client-provided HTML to markdown.
Builds a prompt chain with server and client contexts plus optional chat history.
Optionally integrates an attached file via the Google GenAI SDK.
Returns a synthesized answer from the LLM.
WebsiteValidatorService validates HTML content by converting it to markdown and checking for prompt injection risks using a dedicated prompt and LLM.
Routers expose endpoints for website analysis and validation with request/response models.
Tools implement:
HTML-to-markdown conversion.
Server-side markdown fetching via Jina AI.
Super scraper for advanced content extraction with DOM filtering and asynchronous loading.
Prompts define the instruction templates and chains for answer synthesis and validation.
Configuration manages environment variables and logging.
Section sources
The system follows a layered architecture:
HTTP layer: FastAPI routers accept requests and delegate to services.
Service layer: WebsiteService and WebsiteValidatorService encapsulate business logic.
Tool layer: Utilities for HTML/markdown conversion and content fetching.
Prompt layer: Instruction templates and chains for LLM interactions.
Configuration layer: Environment and logging setup.
Diagram sources
WebsiteService#
WebsiteService coordinates:
Server-side markdown retrieval via Jina AI.
Client-side HTML-to-markdown conversion.
Chat history formatting.
Optional attached file processing via Google GenAI SDK.
LLM answer synthesis using a composed prompt chain.
Key processing logic:
Validates presence of required fields.
Fetches server markdown and logs length.
Converts client HTML to markdown when provided.
Formats chat history into a string.
Handles attached file upload and generation via Google GenAI if present.
Falls back to LLM-based answer synthesis otherwise.
Diagram sources
Section sources
WebsiteValidatorService#
WebsiteValidatorService performs:
HTML-to-markdown conversion.
Prompt injection risk assessment using a dedicated prompt template and LLM.
Boolean safety determination based on model output.
Diagram sources
Section sources
Tools: HTML to Markdown and Server-Side Fetching#
HTML-to-Markdown converter uses BeautifulSoup and html2text to normalize and convert HTML bodies to markdown.
Server-side markdown fetcher uses a Jina AI proxy to retrieve clean markdown from URLs.
Super scraper leverages WebBaseLoader with BeautifulSoup filters and asynchronous loading to extract structured content.
Diagram sources
Section sources
Prompts and Chains#
The website prompt defines a two-context synthesis strategy: server-fetched markdown and client-rendered markdown, with guidelines for summaries, structure, links/media, code, metadata, data analysis, and formatting.
A runnable chain composes the prompt with the LLM client and an output parser.
The validator prompt checks for prompt injection attempts and returns a boolean safety signal.
Diagram sources
Section sources
Request/Response Models#
WebsiteRequest includes URL, question, optional chat history, optional client HTML, and optional attached file path.
WebsiteResponse wraps the generated answer.
Diagram sources
Section sources
MCP Server Integration#
The MCP server exposes tools for website analysis:
website.fetch_markdown: Fetches markdown content for a given URL via a Jina proxy.
website.html_to_md: Converts raw HTML to markdown.
Diagram sources
Section sources
WebsiteService depends on:
Tools for markdown fetching and HTML conversion.
Prompts for constructing the answer chain.
Configuration for logging.
WebsiteValidatorService depends on:
Tools for HTML-to-markdown conversion.
Validator prompt and LLM for safety assessment.
Routers depend on:
Models for request/response validation.
Services for business logic.
Diagram sources
Section sources
Asynchronous loading: The super scraper uses asynchronous document loading to improve throughput when fetching multiple pages.
Selective parsing: BeautifulSoup filters limit parsing to relevant DOM sections, reducing overhead.
Caching and reuse: Reuse server-fetched markdown and client-provided markdown to avoid redundant conversions.
Rate limiting and retries: Integrate retry logic and backoff when calling external services like Jina AI and Google GenAI.
Timeout configuration: Set explicit timeouts for network requests to prevent long blocking operations.
Chunking and pagination: For very large pages, consider chunking content before passing to the LLM to manage token limits.
Environment tuning: Adjust logging levels and environment variables for production deployments to minimize overhead.
[No sources needed since this section provides general guidance]
Common issues and resolutions:
HTTP 400/500 errors from website router:
Ensure URL and question are provided in the request payload.
Check service logs for detailed error messages.
Empty or malformed markdown:
Verify the URL resolves correctly and returns HTML.
Confirm client HTML is well-formed when passed for conversion.
Prompt injection validation failures:
Review the validator response and sanitize HTML accordingly.
Consider additional sanitization steps before conversion.
Google GenAI file processing errors:
Confirm API keys are configured and accessible.
Validate the file path and permissions.
Network timeouts or rate limits:
Add retry logic with exponential backoff.
Monitor external service availability and adjust timeouts.
Section sources
The website analysis integration combines robust content fetching, intelligent HTML-to-markdown conversion, and LLM-driven synthesis to deliver accurate answers from web pages. Validation ensures safety against prompt injection, while advanced scraping tools enable structured content extraction. By following the outlined workflows, patterns, and best practices, teams can deploy reliable, ethical, and high-performance web analysis capabilities.
[No sources needed since this section summarizes without analyzing specific files]
Example Workflows#
Basic website analysis:
Client posts WebsiteRequest to the website router.
Service fetches server markdown, optionally converts client HTML, builds the prompt chain, and returns an answer.
Website validation:
Client posts WebsiteValidatorRequest to the validator router.
Service converts HTML to markdown and runs the validator prompt; returns a safety decision.
Super scraper usage:
Invoke the super scraper to asynchronously load and filter content from a URL, returning a structured document for downstream processing.
Section sources